Runbook: Certificate Expiry
Alert
- Prometheus Alert:
CertManagerCertExpirySoon/CertManagerCertNotReady - Grafana Dashboard: cert-manager dashboard
- Firing condition: Certificate will expire within 30 days, or Certificate resource is in a not-ready state
Severity
Critical -- Expired certificates cause TLS failures across the platform, breaking ingress traffic, inter-service communication, and webhook connectivity.
Impact
- Istio ingress gateway stops accepting HTTPS connections
- Webhook services (Kyverno, cert-manager) fail validation/mutation
- Internal mTLS certificates may fail rotation, degrading mesh communication
- Harbor, Grafana, Keycloak UI access via Istio gateway breaks
Investigation Steps
- List all certificates and their status:
kubectl get certificates -A
- Check for certificates that are not ready or near expiry:
kubectl get certificates -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter,RENEWAL:.status.renewalTime'
- Describe the failing certificate for detailed condition messages:
kubectl describe certificate <certificate-name> -n <namespace>
- Check cert-manager controller logs for errors:
kubectl logs -n cert-manager deployment/cert-manager --tail=100
- Check the CertificateRequest resources associated with the failing certificate:
kubectl get certificaterequest -n <namespace>
kubectl describe certificaterequest <name> -n <namespace>
- Check the Order and Challenge resources (for ACME/Let's Encrypt issuers):
kubectl get orders -A
kubectl get challenges -A
- Verify the ClusterIssuer or Issuer is ready:
kubectl get clusterissuers
kubectl describe clusterissuer <issuer-name>
- Check cert-manager webhook health:
kubectl get pods -n cert-manager
kubectl logs -n cert-manager deployment/cert-manager-webhook --tail=50
- Check the cert-manager HelmRelease status:
flux get helmrelease cert-manager -n cert-manager
Resolution
Certificate stuck in not-ready state
- Delete the failing CertificateRequest to trigger a new one:
kubectl delete certificaterequest <name> -n <namespace>
- Force cert-manager to re-issue by adding a temporary annotation:
kubectl annotate certificate <name> -n <namespace> cert-manager.io/issue-temporary-certificate="true" --overwrite
- Then remove it to trigger the real issuance:
kubectl annotate certificate <name> -n <namespace> cert-manager.io/issue-temporary-certificate-
ClusterIssuer not ready (self-signed CA)
- Check the CA secret exists:
kubectl get secret -n cert-manager | grep ca
- If the root CA secret is missing, recreate it. Check the ClusterIssuer spec for the expected secret name:
kubectl describe clusterissuer sre-ca-issuer
- Re-apply the cert-manager manifests via Flux:
flux reconcile helmrelease cert-manager -n cert-manager
ACME challenge failure (Let's Encrypt)
- Check challenge status:
kubectl describe challenge <name> -n <namespace>
- Verify DNS is resolving correctly for the domain
- Verify the HTTP-01 solver can reach the challenge endpoint (check Istio gateway and VirtualService)
Manual certificate renewal
- Delete the existing secret to force re-issuance:
kubectl delete secret <tls-secret-name> -n <namespace>
- cert-manager will detect the missing secret and re-issue automatically
cert-manager pods not running
- Check the HelmRelease:
flux get helmrelease cert-manager -n cert-manager
- If the release is in a failed state, suspend and resume:
flux suspend helmrelease cert-manager -n cert-manager
flux resume helmrelease cert-manager -n cert-manager
- Force reconciliation:
flux reconcile helmrelease cert-manager -n cert-manager --with-source
Prevention
- Monitor the
certmanager_certificate_expiration_timestamp_secondsmetric in Prometheus - Set alerting thresholds at 30 days (warning) and 7 days (critical) before expiry
- Ensure cert-manager has sufficient RBAC to create/update secrets in all namespaces
- Test certificate renewal in a staging environment before production
- Keep cert-manager updated via the Flux HelmRelease version pin (currently
1.14.4) - Verify ClusterIssuer health after any cert-manager upgrade
Escalation
- If certificate issuance fails after multiple retries: check upstream issuer (Let's Encrypt rate limits, internal CA health)
- If cert-manager pods are crash-looping: escalate to platform team lead
- If the issue affects Istio gateway TLS termination: this is a P1 incident affecting all external traffic